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1  Introduction 

Within  the  last  two  years,  approaches  using  both 
stochastic  and  symbolic  techniques  have  proved  ade¬ 
quate  to  deduce  lexical  ambiguity  resolution  rules  with 
less  than  3-4%  error  rate,  when  trained  on  moderate 
sized  (500K  word)  corpora  of  English  text  (e.g.  Church, 
1988;  Kindle,  1989).  The  success  of  these  techniques 
suggests  that  much  of  the  grammatical  structure  of  lan¬ 
guage  may  be  derived  automatically  through  distribu¬ 
tional  analysis,  an  approach  attempted  and  abandoned 
in  the  1950s. 

We  describe  here  two  experiments  to  see  how  far 
purely  distributional  techniques  can  be  pushed  to  au¬ 
tomatically  provide  both  a  set  of  part  of  speech  tags 
for  English,  and  a  grammatical  analysis  of  free  English 
text.  We  also  discuss  the  state  of  a  tagged  NL  corpus  to 
aid  such  research  (now  amounting  to  4  million  words  of 
hand-corrected  part-of-speech  tagging). 

In  the  experiment  described  in  Section  2,  we  have  de¬ 
veloped  a  constituent  boundary  parsing  algorithm  which 
derives  an  (unlabelled)  bracketing  given  text  annotated 
for  part  of  speech  as  input.  This  method  is  bcised  on 
the  hypothesis  that  constituent  boundaries  can  be  ex¬ 
tracted  from  a  given  part-of-speech  n-gram  by  analyzing 
the  mutual  information  values  within  the  n-gram,  ex¬ 
tended  to  a  new  generalization  of  the  information  the¬ 
oretic  measure  of  mutual  information.  This  hypothesis 
is  supported  by  the  performance  of  an  implementation 
of  this  parsing  algorithm  which  determines  recursively 
nested  sentence  structure,  with  an  error  rate  of  roughly 

2  misplaced  boundaries  for  test  sentences  of  length  10- 
15  words,  and  five  misplaced  boundaries  for  sentences 
of  15-30  tokens.  To  combat  a  limited  set  of  specific  cir¬ 
cumstances  in  which  the  hypothesis  fails,  we  use  a  small 
(4  rule,  8  symbol)  distituent  grammar,  which  indicates 
when  two  parts  of  speech  cannot  remain  in  the  same 
constituent. 

In  another  experiment,  described  in  Section  3,  we  in¬ 
vestigate  whether  a  distributional  analysis  can  discover 

^This  work  was  p£irtieiUy  supported  by  DARPA  grant 
No.N0014-85-K0018,  by  DARPA  and  AFOSR  jointly  under  grant 
No.  AFOSR-90-0066,  eind  by  ARO  grant  No.  DAAL  03-89-C0031 
PRI.  Thanks  to  Ken  Church,  Stuart  Shieber,  Meix  Mintz,  Aravind 
Joshi,  Lila  Gleitman  and  Tom  Veatch  for  their  valued  suggestions 
and  discussion. 


a  part  of  speech  tag  set  which  might  prove  adequate  to 
support  experiments  like  that  discussed  above.  We  have 
developed  a  similarity  measure  which  accurately  clus¬ 
ters  closed-class  lexical  items  of  the  same  grammatical 
category,  excepting  words  which  are  ambiguous  between 
multiple  parts  of  speech. 

2  A  Mutual  Information  Parser 

2.1  Introduction 

In  this  section,  we  characterize  a  constituent  boundary 
parsing  algorithm,  using  an  information-theoretic  mea¬ 
sure  called  generalized  mutual  information,  which  serves 
as  an  alternative  to  traditional  grammar-based  parsing 
methods.  We  view  part-of-speech  sequences  as  stochas¬ 
tic  events  and  apply  probabilistic  models  to  these  events. 
Our  hypothesis  is  that  constituent  boundaries,  or  “dis- 
tituents,”  can  be  extracted  from  a  sequence  of  n  cate¬ 
gories,  or  an  n-gram,  by  analyzing  the  mutual  informa¬ 
tion  values  of  the  part-of-speech  sequences  within  that 
n-gram.  In  particular,  we  demonstrate  that  the  gener¬ 
alized  mutual  information  statistic,  an  extension  of  the 
bigram  (pairwise)  mutual  information  of  two  events  into 
n-space,  acts  as  a  viable  measure  of  continuity  in  a  sen¬ 
tence. 

This  hypothesis  ^lssumes  that,  given  any  constituent 
n-gram,  0,02  . .  .a„,  the  probability  of  that  constituent 
occurring  is  usually  significantly  higher  than  the  proba¬ 
bility  of  0,02  . .  .a„a„+i  occurring.  This  is  true,  in  gen¬ 
eral,  because  most  constituents  appear  in  a  variety  of 
contexts.  Once  a  constituent  is  detected,  it  is  usually 
very  difficult  to  predict  what  part-of-speech  will  come 
next.  As  it  turns  out,  however,  there  are  cases  in  which 
this  assumption  is  not  valid,  but  only  a  handful  of  these 
cases  are  responsible  for  a  majority  of  the  errors  made  by 
the  parser.  To  deal  with  these  cases,  our  algorithm  in¬ 
cludes  what  we  will  call  a  distituent  grammar  —  a  list  of 
tag  pairs  which  cannot  be  adjacent  within  a  constituent. 
One  such  pair  is  noun  prep,  since  English  does  not  allow 
a  constituent  consisting  of  a  noun  followed  by  a  preposi¬ 
tion.  Notice  that  the  nominal  head  of  a  noun  phrase  may 
be  followed  by  a  prepositional  phrase;  in  the  context  of 
distituent  parsing,  once  a  sequence  of  tags,  such  as  (prep 
noun),  is  grouped  as  a  constituent,  it  is  considered  as 
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a  unit.  Our  current  distituent  grammar  consists  of  four 
rules  of  two  tokens  each. 

Our  current  implementation  of  this  parsing  algorithm 
determines  a  recursive  unlabeled  bracketing  of  unre¬ 
stricted  English  text.  The  generalized  mutual  informa¬ 
tion  statistic  and  the  distituent  grammar  combine  to 
parse  sentences  with,  on  average,  two  errors  per  sen¬ 
tence  for  sentences  of  15  words  or  less,  and  five  errors  per 
sentence  for  sentences  of  30  words  or  less  (based  on  sen¬ 
tences  from  a  reserved  test  subset  of  the  Tagged  Brown 
Corpus,  see  footnote  2).  Many  of  the  errors  on  longer 
sentences  result  from  conjunctions,  which  are  tradition¬ 
ally  troublesome  for  grammar-based  algorithms  as  well. 
Further,  this  parsing  technique  is  reasonably  efficient, 
parsing  a  35,000  word  corpus  in  under  10  minutes  on  a 
Sun  4/280. 

While  many  stochastic  approaches  to  natural  language 
processing  that  utilize  frequencies  to  estimate  probabili¬ 
ties  suffer  from  sparse  data,  sparse  data  is  not  a  concern 
in  the  domain  of  our  algorithm.  Sparse  data  usually 
results  from  the  infrequency  of  word  sequences  in  a  cor¬ 
pus.  The  statistics  extracted  from  our  training  corpus 
are  based  on  tag  n-grams  for  a  set  of  64  tags,  not  word  n- 
grams.^  The  corpus  size  is  sufficiently  large  that  enough 
tag  n-grams  occur  with  sufficient  frequency  to  permit 
accurate  estimates  of  their  probabilities.  Therefore,  the 
kinds  of  estimation  methods  of  (n  +  l)-gram  probabili¬ 
ties  using  n-gram  probabilities  discussed  in  Katz  (1987) 
and  Church  Gale  (1989)  are  not  needed. 

This  line  of  research  was  motivated  by  a  series  of 
successful  applications  of  mutual  information  statistics 
to  other  problems  in  natural  language  processing.  In 
the  last  decade,  research  in  speech  recognition  (Je- 
linek  1985),  noun  classification  (Hindle  1988),  predicate 
argument  relations  (Church  Hanks  1989),  and  other 
areas  have  shown  that  mutual  information  statistics  pro¬ 
vide  a  wealth  of  information  for  solving  these  problems. 


2.2  Mutual  Information  Statistics 

The  mutual  information  statistic  (Fano  1961)  is  a  mea¬ 
sure  of  the  interdependence  of  two  signals  in  a  message. 
It  is  a  function  of  the  probabilities  of  the  two  events: 


MJ{x,y)  -  log 


'Px,Y{x,y) 

'Px{x)VY{yy 


In  this  paper,  the  events  x  and  y  will  be  part-of-speech 
n-grams  (instead  of  single  parts-of-speech,  as  in  some 
earlier  work). 

Experiments  that  we  will  not  report  here  show  that 
simple  mutual  information  statistics  computed  on  n- 
gram  sequences  are  not  sufficient  for  the  task  at  hand. 
Instead,  we  have  moved  to  a  statistic  which  we  will  call 
“generalized  mutual  information,”  because  it  is  a  gen¬ 
eralization  of  the  mutual  information  of  part-of-speech 


^The  corpus  we  use  to  train  our  parser  is  the  Tagged  Brown 
Corpus  (Francis  and  Kucera,  1982).  Ninety  percent  of  the  corpus 
is  used  for  training  the  parser,  £ind  the  other  ten  percent  is  used 
for  testing.  The  tag  set  used  is  a  subset  of  the  Brown  Corpus  tag 
set. 


bigrams  into  n-space.  Generalized  mutual  information 
uses  the  context  on  both  sides  of  adjacent  parts-of-speech 
to  determine  a  measure  of  its  distituency  in  a  given  sen¬ 
tence. 

While  our  distituent  parsing  technique  relies  on  gen¬ 
eralized  mutual  information  of  n-grams,  the  foundations 
of  the  technique  will  be  illustrated  with  the  base  case  of 
simple  mutual  information  over  the  space  of  bigrams  for 
expository  convenience. 

2.2.1  Generalized  Mutual  Information 
In  applying  the  concept  of  mutual  information  to  the 
analysis  of  sentences,  the  interdependence  of  part-of- 
speech  n-grams  (sequences  of  n  parts-of-speech)  must 
be  considered.  Thus,  we  consider  an  n-gram  as  a  bigram 
of  an  ni-gram  and  an  n2-gram,  where  ni  -f-  n2  =  n.  The 
mutual  information  of  this  bigram  is 

Notice  that  there  are  (n  — 1)  ways  of  partitioning  an  n- 
gram.  Thus,  for  each  n-gram,  there  is  an  (n  —  1)  vector  of 
mutual  information  values.  For  a  given  n-gram  xi  . . .  a:„ , 
we  can  define  the  mutual  information  values  of  x  by; 

.  .  .  Xn)  —  .  .  .  XjcfXk+l  •  '  '  Xfi') 

=  log - - 

Vixi  ...Xk)Tixk+i  ...X„) 

where  1  <  ifc  <  n. 

Notice  that,  in  the  above  equation,  for  each  A4I*(x), 
the  numerator,  V{xi . . .  x„),  remains  the  same  while  the 
denominator,  P(xi . . .  Xk)V(xk+i  ■  ■  ■  x„),  depends  on  k. 
Thus,  the  mutual  information  value  achieves  its  mini¬ 
mum  at  the  point  where  the  denominator  is  maximized. 
The  empirical  claim  to  be  tested  in  this  paper  is  that 
the  minimum  is  achieved  when  the  two  components  of 
this  n-gram  are  in  two  different  constituents,  i.e.  when 
XkXk+i  is  a  distituent.  Our  experiments  show  that  this 
claim  is  largely  true  with  a  few  interesting  exceptions. 

A  straightforward  approach  would  assign  each  poten¬ 
tial  distituent  a  single  real  number  corresponding  to  the 
extent  to  which  its  context  suggests  it  is  a  distituent. 
But  the  simple  extension  of  bigram  mutual  information 
assigns  each  potential  distituent  a  number  for  each  n- 
gram  of  which  it  is  a  part.  The  question  remains  how 
to  combine  these  numbers  in  order  to  achieve  a  valid 
measure  of  distituency. 

Our  investigations  revealed  that  a  useful  way  to  com¬ 
bine  mutual  information  values  is,  for  each  possible  dis¬ 
tituent  xy,  to  take  a  weighted  sum  of  the  mutual  infor¬ 
mation  values  of  all  possible  pairings  of  n-grams  ending 
with  X  and  n-grams  beginning  with  y,  within  a  fixed 
size  window.  So,  for  a  window  of  size  w  =  A,  given  the 
context  X1X2X3X4,  the  generalized  mutual  information  of 

X2X3  : 

GMl4ixiX2,X3X4), 

=  kiMI{X2,X3)  +  k2MI{X2,X3X4)  + 

k3MI{xiX2,X3)  +  k4MI{xiX2,  X3X4) 
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which  is  equivalent  to 

.  LV[x2X3]'P{x2XzX^^V[xiX2X3]V[xiX-iX3X4]\ 

V  [P[x2]'P[x3]V[xiX2]V[x3Xi^'^  ) 

In  general,  the  generalized  mutual  information  of  any 
given  bigram  xy  in  the  context  a;i . . .  Xi-ixyyi . . .  yj-i 
is  equivalent  to 

/  n  ^xv[x]  \ 

,  X crosses  xy 

log  - 

I  JJ  7>[X](*+J)/2 

\  A” does  not  cross  xy  / 

This  formula  behaves  in  a  manner  consistent  with 
one’s  expectation  of  a  generalized  mutual  information 
statistic.  It  incorporates  all  of  the  mutual  information 
data  within  the  given  window  in  a  symmetric  manner. 
Since  it  is  the  sum  of  bigram  mutual  information  values, 
its  behavior  parallels  that  of  bigram  mutual  information. 

The  standard  deviation  of  the  values  of  the  bigram 
mutual  information  vector  of  an  n-gram  is  a  valid  mea¬ 
sure  of  the  confidence  of  these  values.  Since  distituency 
is  indicated  by  mutual  information  minima,  we  use  the 
reciprocal  of  the  standard  deviation  as  a  weighting  func¬ 
tion. 

2.3  The  Parsing  Algorithm 
The  generalized  mutual  information  statistic  is  the  most 
theoretically  significant  aspect  of  the  mutual  information 
parser.  However,  if  it  were  used  in  a  completely  straight¬ 
forward  way,  it  would  perform  rather  poorly  on  sentences 
which  exceed  the  size  of  the  maximum  word  window. 
Generalized  mutual  information  is  a  local  measure  which 
can  only  be  compared  in  a  meaningful  way  with  other 
values  which  are  less  than  a  word  window  away.  In  fact, 
the  further  apart  two  potential  distituents  are,  the  less 
meaningful  the  comparison  between  their  corresponding 
QM.T  values.  Thus,  it  is  necessary  to  compensate  for  the 
local  nature  of  this  measure  algorithmically. 

He  directed  the  cortege  of  autos  to  the  dunes 
near  Santa  Monica. 

Figure  1:  Sample  sentence  from  the  Brown  Corpus 

In  Magerman  and  Marcus  (1990)  we  describe  the  pars¬ 
ing  algorithm  in  detail,  and  trace  the  parsing  of  a  sam¬ 
ple  sentence  (Figure  1)  selected  from  the  section  of  the 
Tagged  Brown  Corpus  which  was  not  used  for  training 
the  parser.  The  sample  sentence  is  viewed  by  the  parser 
as  a  tag  sequence,  since  the  words  in  the  sentence  are 
not  accounted  for  in  the  parser’s  statistical  model. 

A  bigram  mutual  information  value  vector  and  its 
standard  deviation  are  calculated  for  each  n-gram  in  the 
sentence,  where  2  <  n  <  10.  If  the  frequency  of  an 
n-gram  is  below  a  certain  threshold  (<  10,  determined 
experimentally),  then  the  mutual  information  values  are 
all  assumed  to  be  1,  indicating  that  no  information  is 


given  by  that  n-gram.  These  values  are  calculated  once 
for  each  sentence  and  referenced  frequently  in  the  parse 
process. 


Distituent 

Pass  1 

DC 

Pass  2 

Pass  3 

pro  verb 

3.28 

3.28 

3.28 

3.28 

verb  det 

3.13 

3.13 

3.13 

3.13 

det  noun 

11.18 

11.18 

noun  prep 

11.14 

—  OO 

8.18 

prep  noun 

1.20 

1.20 

noun  prep 

7.41 

— oo 

3.91 

2.45 

prep  det 

16.89 

16.89 

10.83 

det  noun 

16.43 

16.43 

noun  prep 

12.73 

— OO 

7.64 

4.13 

prep  noun 

7.36 

7.36 

Figure  2:  Parse  node  table  for  sample  sentence 


Next,  a  parse  node  is  allocated  for  each  tag  in  the  sen¬ 
tence.  A  generalized  mutual  information  value  is  com¬ 
puted  for  each  possible  distituent,  i.e.  each  pair  of  parse 
nodes,  using  the  previously  calculated  bigram  mutual  in¬ 
formation  values.  The  resulting  parse  node  table  for  the 
sample  sentence  is  indicated  by  Pass  1  in  the  parse  node 
table  (Figure  2). 

At  this  point,  the  algorithm  deviates  from  what  one 
might  expect.  As  a  preprocessing  step,  the  distituent 
grammar  is  invoked  to  flag  any  known  distituents  by 
replacing  their  Q MX  value  with  —  oo.  The  results  of  this 
phase  are  indicated  in  the  DC  column  in  the  parse  node 
table. 

The  first  w  tags  in  the  sentence  are  processed  using 
an  n-ary-branching  recursive  function  which  branches 
at  the  minimum  QMX  value  of  the  given  window,  with 
marginal  differences  between  QMX  values  ignored.  The 
local  minima  at  which  branching  occurs  in  each  pass  of 
the  parse  are  indicated  by  italics  in  the  parse  node  table. 

Instead  of  using  this  tree  in  its  entirety,  only  the 
nodes  in  the  leftmost  and  rightmost  constituent  leaves 
are  pruned.  The  rest  of  the  nodes  in  the  window  are 
thrown  back  into  the  pool  of  nodes.  The  algorithm  is 
applied  again  to  the  leftmost  and  rightmost  w  remain¬ 
ing  tags  until  no  more  tags  remain.  The  first  pass  of  the 
parser  is  complete,  and  the  sentence  has  been  partitioned 
into  constituents  (Figure  3). 

(He)  (directed)  (the  cortege)  (of  autos) 

(to)  (the  dunes)  (near  Santa  Monica) 

Figure  3:  Constituent  structure  after  Pass  1 

The  algorithm  terminates  when  no  new  structure  has 
been  ascertained  on  a  pass,  or  when  the  lengths  of  two 
adjacent  constituents  sum  to  greater  than  w.  After  two 
more  passes  of  the  algorithm,  the  sample  sentence  is  par¬ 
titioned  into  two  adjacent  constituents,  and  thus  the  al¬ 
gorithm  terminates,  with  the  result  in  figure  4.  In  this 
example,  the  prepositional  phrase  “near  Santa  Monica” 
is  not  attached  to  the  noun  phrase  “the  dunes”  as  it 
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should  be;  therefore,  the  parser  output  for  the  sample 
sentence  has  one  error. 

(He  (directed  ((the  cortege)  (of  autos))) 

((to  (the  dunes)) 

(near  Santa  Monica))) 

Figure  4:  Resulting  constituent  structure  after  Pass  3 

2.4  Results 

A  careful  evaluation  of  this  parser,  like  any  other,  re¬ 
quires  some  “gold  standard”  against  which  to  judge  its 
output.  Soon,  we  will  be  able  to  use  the  skeletal  pars¬ 
ing  of  the  Penn  Treebank  we  are  about  to  begin  pro¬ 
ducing  to  evaluate  this  work  (although  evaluating  this 
parser  against  materials  which  we  ourselves  provide  is 
admittedly  problematic).  For  the  moment,  we  have  sim¬ 
ply  graded  the  output  of  the  parser  by  hand  ourselves. 
While  the  error  rate  for  short  sentences  (15  words  or 
less)  with  simple  constructs  is  accurate,  the  error  rate 
for  longer  sentences  is  more  of  an  approximation  than  a 
rigorous  value. 

On  unconstrained  free  text  from  a  reserved  test  cor¬ 
pus,  the  parser  averages  about  two  errors  per  sentence 
for  sentences  under  15  words  in  length.  On  sentences 
between  16  and  30  tokens  in  length,  it  averages  between 
5  and  6  errors  per  sentence.  In  nearly  all  of  these  longer 
sentences  and  many  of  shorter  ones,  at  least  one  of  the 
errors  is  caused  by  confusion  about  conjuncts. 

One  interesting  possibility  is  to  use  the  generalized 
mutual  information  statistic  to  extract  a  grammar  from 
a  corpus.  Since  the  statistic  is  consistent,  and  its  win¬ 
dow  can  span  more  than  two  constituents,  it  could  be 
used  to  find  constituent  units  which  occur  with  the  same 
distribution  in  similar  contexts.  Given  the  results  of 
the  next  section,  it  may  well  be  possible  to  use  auto¬ 
matic  techniques  to  first  determine  a  first  approxima¬ 
tion  to  the  set  of  word  classes  of  a  language,  given  only 
a  large  corpus  of  text,  and  then  extract  a  grammar  for 
that  set  of  word  classes.  Such  a  goal  is  very  difficult, 
of  course,  but  we  believe  that  it  is  worth  pursuing.  In 
the  end,  we  believe  that  this,  like  many  problems  in 
natural  language  processing,  cannot  be  solved  efficiently 
by  grammar-based  algorithms  nor  accurately  by  purely 
stochastic  algorithms.  We  believe  strongly  that  the  so¬ 
lution  to  some  of  these  problems  may  well  be  a  combi¬ 
nation  of  both  approaches. 

3  Discovering  the  Word  Classes 
of  a  Language 
3.1  Introduction 

As  we  ask  immediately  above,  to  what  extent  is  it  pos¬ 
sible  to  discover  by  some  kind  of  distributional  analysis 
the  kind  of  part-of-speech  tags  upon  which  our  mutual 
information  parser  depends?  In  this  section,  we  exam¬ 
ine  the  possibility  of  using  distributional  analysis  to  dis¬ 


cover  the  feature  set  and  word  classes  of  a  language.^  It 
is  based  upon  the  following  idea,  a  variant  of  the  dis¬ 
tributional  analysis  methods  from  Structural  Linguistics 
(Harris  51, Harris  68):  features  license  the  distributional 
behavior  of  lexical  items.  At  the  two  extremes,  a  word 
with  no  features  would  not  be  licensed  to  appear  in  any 
context  at  all,  whereas  a  word  marked  with  all  features 
of  the  language  would  be  licensed  to  appear  in  every 
possible  context. 

3.2  The  Algorithm 

The  feature  discovery  system  works  as  follows.  First, 
a  large  amount  of  text  is  examined  to  discover  the  fre¬ 
quency  of  occurrence  of  different  bigrams.'*  Based  upon 
this  data,  the  system  groups  words  into  classes.  Two 
words  are  in  the  same  class  if  they  can  occur  in  the  same 
contexts.  In  order  to  determine  whether  x  and  y  belong 
to  the  same  class,  the  sytem  first  examines  all  bigrams 
containing  x.  If  for  a  high  percentage  of  these  bigrams, 
the  corresponding  bigram  with  y  substituted  for  x  exists 
in  the  corpus,  then  it  is  likely  that  y  has  all  of  the  fea¬ 
tures  that  X  has  (and  maybe  more).  If  upon  examining 
the  bigrams  containing  y  the  system  is  able  to  conclude 
that  X  also  has  all  of  the  features  that  y  has,  it  then 
concludes  that  x  and  y  are  in  the  same  class. 

For  every  pair  of  bigrams,  the  system  must  determine 
how  much  to  weigh  the  presence  of  those  bigrams  as  ev¬ 
idence  that  two  words  have  features  in  common.  For 
instance,  assume:  (a)  the  bigram  the  boy  appears  many 
times  in  the  corpus  being  analyzed,  while  the  sits  never 
occurs.  Also  assume:  (b)  the  bigram  boy  the  (as  in  the 
boy  the  girl  kissed  . . .)  occurs  once  and  sits  the  never 
occurs.  Case  (a)  should  be  much  stronger  evidence  that 
boy  and  sits  are  not  in  the  same  class  than  case  (b). 
For  each  bigram  ax  occurring  in  the  corpus,  evidence 
offered  by  the  presence  (or  absence)  of  the  bigram  ay 
is  scaled  by  the  frequency  of  ax  in  the  text  divided  by 
the  total  number  of  bigrams  containing  x  on  their  right 
hand  side.  Since  the  end-of-phrase  position  is  less  re¬ 
strictive,  we  would  expect  each  bigram  involving  this 
position  and  the  word  to  the  right  of  it  to  occur  less  fre¬ 
quently  than  bigrams  of  two  phrase-internal  words.  By 
weighing  the  evidence,  bigrams  which  cross  boundaries 
will  be  weighed  less  than  those  which  do  not. 

3.2.1  The  Specifics 

The  function  implies(x,y)  calculates  the  likelihood  (on 
a  scale  of  [0..1])  that  word  y  contains  all  of  the  features 
of  word  X.  For  example,  we  would  expect  the  value  of 
implies(’a’,  ’the’)  to  be  close  to  1,  since  ’the’  can  occur 
in  any  context  which  ’a’  can  occur  in.  Note  that:  im- 
plies(x,y)  A  implles(y,x)  iff  x  and  y  are  in  the  same 
class. 

®We  consider  the  set  of  features  of  a  peirticuleur  language  to  be 
ekll  attributes  which  that  language  makes  reference  to  in  its  syntax. 

*For  this  experiment,  we  take  a  very  local  view  of  context,  only 
considering  bigrams. 
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,  ,  leftimply{x,y)  +  rightimply(x,y) 

impnes(x,y)  = - - - — - - - 

The  function  leftimply(x,y)  is  the  likelihood  (on  a 
scale  of  [0..1])  that  y  contains  all  of  the  features  of 
X,  where  this  likelihood  is  derived  from  looking  at  bi¬ 
grams  of  the  form:  xa.  rightimply(x,y)  derives  the 
likelihood  by  examining  all  bigrams  of  the  form:  ax. 
bothoccur(a,/?)  is  1  if  both  bigrams  a  and  occur 
in  the  corpus,  and  /?  occurs  with  a  frequency  at  least 
1/THRESHOLD  of  that  of  a,  for  some  THRESHOLD.® 
bothoccur  accounts  for  the  fact  that  we  cannot  expect 
the  distribution  of  two  equivalent  words  over  bigrams  to 
be  precisely  the  same,  but  we  would  not  expect  the  two 
distributions  to  be  too  dissimilar  either. 

leftimply{x,  y)  = 

Y^^^^{percentageujt{x ,  z)  *  bothoccuruft(xz,yz)) 
percentage, ejtix,  z) 

rigktimply(x  ,y)  — 

z)  *  bothoccurright(zx,zy)) 


y  such  that  both  implies(x,y)  and  implies(y,x)  are 
greater  than  the  threshold.  We  next  take  the  transi¬ 
tive  closure  of  pairs  of  sets  with  nonempty  intersection 
over  all  of  these  sets,  and  the  result  is  a  set  of  sets,  where 
each  set  is  a  word  class.  Classes  of  different  degrees  of 
specificity  are  found  by  varying  the  degree  of  similarity 
between  distributions  needed  to  conclude  that  two  words 
are  in  the  same  class.  If  a  high  degree  of  similarity  is  re¬ 
quired,  all  words  in  a  class  will  have  the  same  features. 
If  a  lower  degree  of  similarity  is  required,  then  words  in 
a  class  must  have  most,  but  not  all,  of  the  same  features. 

3.3  The  Experiment 

To  test  the  algorithm  discussed  above,  we  ran  the  fol¬ 
lowing  experiment.  First,  the  number  of  occurrences  of 
each  bigram  in  the  corpus  was  determined.  Statistics  on 
distribution  were  determined  by  examining  the  complete 
Brown  Corpus  (Francis  82),  where  infrequently  occurring 
open-class  words  were  replaced  with  their  part-of-speech 
tag.  We  then  ran  the  program  on  a  group  of  words  in¬ 
cluding  all  closed-class  words  which  occurred  more  than 
250  times  in  the  corpus,  and  the  most  frequently  occur¬ 
ring  open-class  words.  Note  that  the  system  attempted 
to  determine  the  relations  between  these  words;  this  does 
not  mean  that  it  only  considered  bigrams  a/?,  where  both 
a  and  fi  were  from  this  list  of  words  which  were  being 
partitioned.  All  bigrams  which  occurred  more  than  5 
times  were  considered  in  the  distributional  analysis. 


hothoccur,ijt{ab,cd)  = 

1  if  bigrams  ab  and  cd  appear  in  the  corpus  and 
percentageie/((c,d)  >  (1/THRESHOLD  * 
percentageie/t(a,b)) 

0  otherwise 

When  computing  the  relation  between  x  and  all 
other  words,  we  use  the  following  function,  percent¬ 
age,  to  weigh  the  evidence  (as  described  above),  where 
count(ab)  is  the  number  of  occurrences  of  the  bigram 
ab  in  the  corpus,  and  numright(x)  (nunileft(x))  is  the 
total  number  of  bigrams  with  x  on  their  right  hand  side 
(left  hand  side). 


percentage,ejt{x,y) 


count{xy) 

numleft(x) 


.  .  countiyx 

percentageright{x,y)  = - r-7-7 

numright{ 

For  all  pairs  of  words,  x  and  y,  we  calculate  im- 
plies(x,y)  and  implles(y,x).  We  can  then  find  word 
classes  in  the  following  way.  We  first  determine  a  thresh¬ 
old  value,  where  a  stronger  value  will  result  in  more  spe¬ 
cific  classes.  Then,  for  each  word  x,  we  find  all  words 


3.4  Analysis  of  the  Experiment 
The  program  successfully  partitioned  words  into  word 
classes.®  In  addition,  it  was  able  to  find  more  fine¬ 
grained  features.  Among  the  features  found  were: 
[possessive-pronoun],  [singular-determiner],  [definite- 
determiner],  [wh-adjunct]  and  [pronoun-|-be].  A  descrip¬ 
tion  of  some  of  the  word  classes  the  program  discovered 
can  be  found  in  Appendix  A. 


3.5  The  Psychological  Plausibility  of 
Distributional  Analysis 

If  a  child  does  not  know  a  priori  what  features  are  used 
in  her  language,  there  are  two  ways  in  which  she  can 
acquire  this  information:  by  using  either  syntactic  or  se¬ 
mantic  cues.  The  child  could  use  syntactic  cues  such  as 
the  method  of  distributional  analysis  described  in  this 
paper.  The  child  might  also  rely  upon  semantic  cues. 
There  is  evidence  that  children  use  syntactic  rather  than 
semantic  cues  in  classifying  words.  Peter  Gordon  (Gor¬ 
don  85)  ran  an  experiment  where  the  child  was  presented 
with  an  object  which  was  given  a  made  up  name.  For 
objects  with  semantic  properties  of  count  nouns  (mass 
nouns),  the  word  was  used  in  lexical  environments  which 
only  mass  nouns  (count  nouns)  are  permitted  to  be  in. 
Gordon  showed  that  the  children  overwhelmingly  used 


^In  the  experiments  we  ran,  we  found  THRESHOLD  =  6  to  give 
the  best  results.  This  value  was  found  by  examining  the  values  of 
implication  found  between  the,  a  cind  an. 


®One  exception  was  the  clckss  of  pronouns.  Since  [4-nominative] 
and  [-nominative]  pronouns  do  not  have  similar  distribution,  they 
were  not  found  to  be  in  the  same  class. 
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Tagger 

No.  of  errors 

Error  rate 

RF 

105 

1.9 

CH 

151 

2.8 

MAM 

127 

2.3 

MP 

158 

2.9 

MW 

136 

2.5 

Mean 

135 

2.5 

Raw  no. 
of  words 

Times 

tagged 

Total  no. 
of  words 

Brown  Corpus 

1,159,381 

1 

1,159,381 

Library  of  America 

159,267 

2 

318,534 

DOE  abstracts 

199,928 

2 

399,856 

Dow  Jones  Corpus 

2,644,618 

1 

2,644,618 

Grand  total 

4,163,194 

4,522,389 

Table  1:  Number  of  words  tagged 

the  distributional  cues  and  not  the  semantic  cues  in  clas¬ 
sifying  the  words.  Virginia  Gathercole  (Gathercole  85) 
found  that  “children  do  not  approach  the  co-occurrence 
conditions  of  much  and  many  with  various  nouns  from 
a  semantic  point  of  view,  but  rather  from  a  morphosyn- 
tactic  or  surface-distributional  one.”  Yonata  Levy  (Levy 
83)  examined  the  mistakes  young  children  make  in  clas¬ 
sifying  words.  The  mistakes  made  were  not  those  one 
would  expect  the  child  to  make  if  she  were  using  seman¬ 
tic  cues  to  classify  words. 

4  Penn  Treebank 

In  this  section,  we  report  some  recent  performance  mea¬ 
sures  of  the  Penn  Treebank  Project. 

To  date,  we  have  tagged  over  4  million  words  by  part  of 
speech  (cf.  Table  1).  We  are  tagging  this  material  with 
a  much  simpler  tagset  than  used  by  previous  projects, 
as  discussed  at  the  Oct.  1989  DARPA  Workshop.  The 
material  is  first  processed  using  Ken  Church’s  tagger 
(Church  1988),  which  labels  it  as  if  it  were  Brown  Corpus 
material,  and  then  is  mapped  to  our  tagset  by  a  SED- 
script.  Because  of  fundamental  differences  in  tagging 
strategy  between  the  Penn  Treebank  Project  and  the 
Brown  project,  the  resulting  mapping  is  about  9%  in¬ 
accurate,  given  the  tagging  guidelines  of  the  Penn  Tree- 
bank  project  (as  given  in  40  pages  of  explicit  tagging 
guidelines).  This  material  is  then  hand-corrected  by  our 
annotators;  the  result  is  consistent  within  annotators  to 
about  3%  (cf.  Table  3),  and  correct  (again,  given  our 
tagging  guidelines)  to  about  2.5%  (cf.  Table  2),  as  will 
be  discussed  below.  We  intend  to  use  this  material  to 
retrain  Church’s  tagger,  which  we  then  believe  will  be 
accurate  to  less  than  3%  error  rate.  We  will  then  adju¬ 
dicate  between  the  output  of  this  new  tagger,  run  on  the 
same  corpus,  and  the  previously  tagged  material.  We 
believe  that  this  will  yield  well  below  1%  error,  at  an 
additional  cost  of  between  5  and  10  minutes  per  1000 
words  of  material.  To  provide  exceptionally  accurate 
bigram  frequency  evidence  for  retraining  the  automatic 
tagger  we  are  using,  two  subcorpora  (Libreiry  of  America, 
DOE  abstracts)  were  tagged  twice  by  different  annota¬ 
tors,  and  the  Library  of  America  texts  were  adjudicated 
by  a  third  annotator,  yielding  ~160,000  words  tagged 
with  an  accuracy  estimated  to  exceed  99.5%. 

Table  2  provides  an  estimate  of  error  rate  for  part-of- 
speech  annotation  based  on  the  tagging  of  the  sample 
described  above.  Error  rate  is  measured  in  terms  of  the 


Table  2:  Error  rates 


CH 

MAM 

MP 

MW 

RF 

2.6% 

3.5% 

3.2% 

3.0% 

CH 

- 

2.9% 

3.9% 

3.7% 

MAM 

- 

- 

3.3% 

2.7% 

MP 

- 

- 

- 

2.8% 

Mean: 

3.2% 

Table  3:  Inter-annotator  inconsistency 

number  of  disagreements  with  a  benchmark  version  of 
the  sample  prepared  by  Beatrice  Santorini.  We  have 
also  estimated  the  rate  of  inter-annotator  inconsistency 
based  on  the  tagging  of  the  sample  described  above  (cf. 
Table  3).  Inconsistency  is  measured  in  terms  of  the  pro¬ 
portion  of  disagreements  of  each  of  the  annotators  with 
each  other  over  the  total  number  of  words  in  the  test 
corpus  (5,425  words). 

Table  4  provides  an  estimate  of  speed  of  part-of- 
speech  annotation  for  a  set  of  ten  randomly  selected  texts 
from  the  Dow  Jones  Corpus  (containing  a  total  of  5,425 
words),  corrected  by  each  of  our  annotators.  The  an¬ 
notators  were  throughly  familiar  with  the  genre,  having 
spent  over  three  months  immediately  prior  to  the  ex¬ 
periment  correcting  texts  from  the  same  Corpus.  Given 
that  the  average  productivity  overall  of  our  project  has 
been  between  3,000-3,500  words  per  hour  of  time  billed 
by  our  annotators,  it  appears  that  our  strategy  of  hiring 
annotators  for  no  more  than  3  hours  a  day  has  proven 
to  be  quite  successful. 

Finally,  the  summary  statistics  in  Table  5  provide  an 
estimate  of  improvement  of  annotation  speed  as  a  func¬ 
tion  of  familiarity  with  genre.  We  compared  the  anno¬ 
tators’  speed  on  two  samples  of  the  Brown  Corpus  (10 
texts)  and  the  Dow  Jones  Corpus  (100  texts).  We  ex¬ 
amined  the  first  and  last  samples  of  each  genre  that  the 


Tagger 

Time 

(in  minutes) 

Words 
per  hour 

Minutes  per 
1,000  words 

RF 

68 

4,804 

12.5 

CH 

79 

4,129 

14.5 

MAM 

57 

5,751 

10.4 

MP 

74 

4,423 

13.3 

MW 

100 

3,268 

18.3 

Mean 

76 

4,283 

14.0 

Table  4:  Speed  of  part-of-speech  annotation 
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Words 

Minutes  per 

per  hour 

1,000  words 

Early 

Brown 

2,816 

21.3 

Dow  Jones 

1,711 

35.1 

Mean 

2,621 

22.9 

Late 

Brown 

3,483 

17.2 

Dow  Jones 

3,641 

16.5 

Mean 

3,511 

17.1 

Improvement 

34% 

25% 

Table  5:  Speed  as  function  of  familiarity  with  genre 


annotators  tagged;  in  each  case,  more  than  two  months 
of  experience  lay  between  the  samples. 
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